1 Using Convex Pseudo - Data to Increase Prediction Accuracy

نویسنده

  • Leo Breiman
چکیده

A prediction algorithm is consistent if given a large enough sample of instances from the underlying distribution, it can achieve nearly optimal generalization accuracy. In practice, the training set is finite and does not give an adequate representation of the underlying distribution. Our work is based on a simple method for generating additional data from the existing data. Using this new data (convex pseudo-data) it is shown empirically that on a variety of data sets prediction accuracy of an algorithm can be significantly improved. This is shown first in classification using the CART algorithm. Similar results are shown in regression. Then pseudo-data is applied to bagging CART. Although CART is being used as a test bed, the idea of generating convex psuedo-data can be applied to any prediction method. Given a training set T={(y n ,x n), n=1, ... ,N} where the y are either class labels or numerical values and the x are M-dimensional input vectors, consider an algorithm that uses T to construct a predictor h(x) of future y-values given the input values x. If the algorithm is consistent i.e. CART, C4.5, Neural Nets, then the larger the size N of the training set, the smaller the generalization error. One dream, assuming the training set consists of independent draws from the same underlying distribution P(dy,dx) is that we can keep drawing as long as we want-constructing a large training set. But generally, we are confined to a training set of given size. One way around this restriction is to try and manufacture new data from the given training set. Another approach is to use T to estimate P(dy,dx) (Shang and Breiman[1996b]). Using the estimated probability distribution, an infinite number of instances can be generated. This method gave promising results but ran into serious technical difficulties. Kernel density estimates were used for the numerical estimates, and ad hoc methods for categorical values. Numerous parameters had to be estimated to plug into the probability estimate, and the result was an complex and unwieldy procedure. Our method for generating new data from the old is relatively simple and depends only on a single parameter d, 0<d<1. To create a new data instance, these steps are followed: I) Select two instances (y,x) , (y',x') at random from the training set. ii) Select a random number v from the interval [0,d], and let u=1-v. iii) The new instance is (y'',x'') where y''=y, and …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Real Time Pseudo-Range Correction Predicting by a Hybrid GASVM model in order to Improve RTDGPS Accuracy

Differential base station sometimes is not capable of sending correction information for minutes, due to radio interference or loss of signals. To overcome the degradation caused by the loss of Differential Global Positioning System (DGPS) Pseudo-Range Correction (PRC), predictions of PRC is possible. In this paper, the Support Vector Machine (SVM) and Genetic Algorithms (GAs) will be incorpor...

متن کامل

Accuracy Improvement of Mood Disorders Prediction using a Combination of Data Mining and Meta-Heuristic Algorithms

Introduction: Since the delay or mistake in the diagnosis of mood disorders due to the similarity of their symptoms hinders effective treatment, this study aimed to accurately diagnose mood disorders including psychosis, autism, personality disorder, bipolar, depression, and schizophrenia, through modeling and analyzing patients' data. Method: Data collected in this applied developmental resear...

متن کامل

Accuracy Improvement of Mood Disorders Prediction using a Combination of Data Mining and Meta-Heuristic Algorithms

Introduction: Since the delay or mistake in the diagnosis of mood disorders due to the similarity of their symptoms hinders effective treatment, this study aimed to accurately diagnose mood disorders including psychosis, autism, personality disorder, bipolar, depression, and schizophrenia, through modeling and analyzing patients' data. Method: Data collected in this applied developmental resear...

متن کامل

Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks

Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...

متن کامل

Prediction of Body Center of Mass Acceleration From Trunk and Lower Limb Joints Accelerations During Quiet Standing

Purpose: Predicting body Center of Mass (COM) acceleration is carried out with more accuracy based on the acceleration of three joints of lower limb compared to only accounting joints of hip and ankle. Given that trunk movement during quite standing is noticeable, calculating trunk acceleration in model might increase prediction accuracy of COM acceleration. Moreover, in previous research studi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998